141 research outputs found

    Solving the challenges of concept drift in data stream classification.

    Get PDF
    The rise of network connected devices and applications leads to a significant increase in the volume of data that are continuously generated overtime time, called data streams. In real world applications, storing the entirety of a data stream for analyzing later is often not practical, due to the data stream’s potentially infinite volume. Data stream mining techniques and frameworks are therefore created to analyze streaming data as they arrive. However, compared to traditional data mining techniques, challenges unique to data stream mining also emerge, due to the high arrival rate of data streams and their dynamic nature. In this dissertation, an array of techniques and frameworks are presented to improve the solutions on some of the challenges. First, this dissertation acknowledges that a “no free lunch” theorem exists for data stream mining, where no silver bullet solution can solve all problems of data stream mining. The dissertation focuses on detection of changes of data distribution in data stream mining. These changes are called concept drift. Concept drift can be categorized into many types. A detection algorithm often works only on some types of drift, but not all of them. Because of this, the dissertation finds specific techniques to solve specific challenges, instead of looking for a general solution. Then, this dissertation considers improving solutions for the challenges of high arrival rate of data streams. Data stream mining frameworks often need to process vast among of data samples in limited time. Some data mining activities, notably data sample labeling for classification, are too costly or too slow in such large scale. This dissertation presents two techniques that reduce the amount of labeling needed for data stream classification. The first technique presents a grid-based label selection process that apply to highly imbalanced data streams. Such data streams have one class of data samples vastly outnumber another class. Many majority class samples need to be labeled before a minority class sample can be found due to the imbalance. The presented technique divides the data samples into groups, called grids, and actively search for minority class samples that are close by within a grid. Experiment results show the technique can reduce the total number of data samples needed to be labeled. The second technique presents a smart preprocessing technique that reduce the number of times a new learning model needs to be trained due to concept drift. Less model training means less data labels required, and thus costs less. Experiment results show that in some cases the reduced performance of learning models is the result of improper preprocessing of the data, not due to concept drift. By adapting preprocessing to the changes in data streams, models can retain high performance without retraining. Acknowledging the high cost of labeling, the dissertation then considers the scenario where labels are unavailable when needed. The framework Sliding Reservoir Approach for Delayed Labeling (SRADL) is presented to explore solutions to such problem. SRADL tries to solve the delayed labeling problem where concept drift occurs, and no labels are immediately available. SRADL uses semi-supervised learning by employing a sliding windowed approach to store historical data, which is combined with newly unlabeled data to train new models. Experiments show that SRADL perform well in some cases of delayed labeling. Next, the dissertation considers improving solutions for the challenge of dynamism within data streams, most notably concept drift. The complex nature of concept drift means that most existing detection algorithms can only detect limited types of concept drift. To detect more types of concept drift, an ensemble approach that employs various algorithms, called Heuristic Ensemble Framework for Concept Drift Detection (HEFDD), is presented. The occurrence of each type of concept drift is voted on by the detection results of each algorithm in the ensemble. Types of concept drift with votes past majority are then declared detected. Experiment results show that HEFDD is able to improve detection accuracy significantly while reducing false positives. With the ability to detect various types of concept drift provided by HEFDD, the dissertation tries to improve the delayed labeling framework SRADL. A new combined framework, SRADL-HEFDD is presented, which produces synthetic labels to handle the unavailability of labels by human expert. SRADL-HEFDD employs different synthetic labeling techniques based on different types of drift detected by HEFDD. Experimental results show that comparing to the default SRADL, the combined framework improves prediction performance when small amount of labeled samples is available. Finally, as machine learning applications are increasingly used in critical domains such as medical diagnostics, accountability, explainability and interpretability of machine learning algorithms needs to be considered. Explainable machine learning aims to use a white box approach for data analytics, which enables learning models to be explained and interpreted by human users. However, few studies have been done on explaining what has changed in a dynamic data stream environment. This dissertation thus presents Data Stream Explainability (DSE) framework. DSE visualizes changes in data distribution and model classification boundaries between chunks of streaming data. The visualizations can then be used by a data mining researcher to generate explanations of what has changed within the data stream. To show that DSE can help average users understand data stream mining better, a survey was conducted with an expert group and a non-expert group of users. Results show DSE can reduce the gap of understanding what changed in data stream mining between the two groups

    Sliding Reservoir Approach for Delayed Labeling in Streaming Data Classification

    Get PDF
    When concept drift occurs within streaming data, a streaming data classification framework needs to update the learning model to maintain its performance. Labeled samples required for training a new model are often unavailable immediately in real world applications. This delay of labels might negatively impact the performance of traditional streaming data classification frameworks. To solve this problem, we propose Sliding Reservoir Approach for Delayed Labeling (SRADL). By combining chunk based semi-supervised learning with a novel approach to manage labeled data, SRADL does not need to wait for the labeling process to finish before updating the learning model. Experiments with two delayed-label scenarios show that SRADL improves prediction performance over the naïve approach by as much as 7.5% in certain cases. The most gain comes from 18-chunk labeling delay time with continuous labeling delivery scenario in real world data experiments

    Evolution of Business Intelligence: An Analysis from the Perspective of Social Network

    Get PDF
    Based on CiteSpace, Pajek and other software, this paper makes a visual analysis of the knowledge graph of the related literature of Business Intelligence and explores the future development trend of business intelligence. Taking the core periodicals of CNKI as the data source, key words are drawn and analyzed with the help of software. The total number of articles was 2938 from 2006 to 2020, and the number of articles published in the past 15 years was gradually levelled off. Among the 607 researchers, Yang Bingru is the representative; there are 424 journals, Journal of Information is the first, and 787 keywords are the most frequently used data mining. Our country still needs in-depth research in the field of business intelligence. Through the atlas, it directly shows that big data and machine learning are the frontier hot spots of future development, which provides research direction for researchers

    Time-Inconsistent Stochastic Linear--Quadratic Control

    Full text link
    In this paper, we formulate a general time-inconsistent stochastic linear--quadratic (LQ) control problem. The time-inconsistency arises from the presence of a quadratic term of the expected state as well as a state-dependent term in the objective functional. We define an equilibrium, instead of optimal, solution within the class of open-loop controls, and derive a sufficient condition for equilibrium controls via a flow of forward--backward stochastic differential equations. When the state is one dimensional and the coefficients in the problem are all deterministic, we find an explicit equilibrium control. As an application, we then consider a mean-variance portfolio selection model in a complete financial market where the risk-free rate is a deterministic function of time but all the other market parameters are possibly stochastic processes. Applying the general sufficient condition, we obtain explicit equilibrium strategies when the risk premium is both deterministic and stochastic.Comment: 24 pages. To be submitted to SICO

    Improving the indoor thermal environment with ceiling radiant terminals

    Get PDF
    A CFD (computational Fluid Dynamics) simulation model of the porous ceiling radiant air-conditioning system was established to study the influence of the ceiling temperature and envelope temperature (including the temperature of the walls and the floor of a room) on the thermal environment in the room equipped with such a system. The results showed that, for the summer condition, higher ceiling temperatures would result in higher indoor air temperature and higher Predicted Percentage Dissatisfied (PPD), which meant potential discomfort of occupants in the room. For the winter condition, however, a higher ceiling temperature within 28°C would result in a lower PPD, thus improved the thermal comfort. Considering the energy-conservation, the thermal comfort could be assured if the ceiling temperature was not more than 28°C. As for the effect of envelope temperature, the result showed that the increase in the envelope temperature during summer could result in a higher indoor air temperature, but the thermal comfort of occupants could still be ensured under such condition. Considering both the thermal comfort and the energyconservation, a ceiling temperature of 18°C (underside surface temperature of the ceiling) and an envelope temperature between 26°C and 32°C were proved appropriate for the summer. Similarly, based on the simulation results, a ceiling temperature of 26°C, and an envelope temperature between 8°C and 11°C were found appropriate for the winter. The results indicated that for the porous ceiling radiant air-conditioning system, ceiling temperature should be controlled to increase the ratio of radiant heat transfer in the summer, and the envelope temperature should be lowered to improve the energy-conservation of the system. In the winter, the heat transfer by radiation of the porous ceiling would account for a larger ratio, therefore the system showed good heating capacity and energyconservation performance in winter.publishedVersio
    corecore